Display machine information:
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.29 R6_2.5.1 jsonlite_1.7.2 magrittr_2.0.1
## [5] evaluate_0.14 stringi_1.7.6 rlang_1.0.1 cli_3.1.0
## [9] rstudioapi_0.13 jquerylib_0.1.4 bslib_0.3.1 rmarkdown_2.11
## [13] tools_3.6.0 stringr_1.4.0 xfun_0.29 yaml_2.2.1
## [17] fastmap_1.1.0 compiler_3.6.0 htmltools_0.5.2 knitr_1.37
## [21] sass_0.4.0
Load database libraries and the tidyverse frontend:
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(miceRanger))
Through the Shiny app developed in HW3, we observe abundant missing values in the MIMIC-IV ICU cohort we created. In this question, we use multiple imputation to obtain a data set without missing values.
Read following tutorials on the R package miceRanger for imputation: https://github.com/farrellday/miceRanger, https://cran.r-project.org/web/packages/miceRanger/vignettes/miceAlgorithm.html.
A more thorough book treatment of the practical imputation strategies is the book Flexible Imputation of Missing Data by Stef van Buuren.
Explain the jargon MCAR, MAR, and MNAR.
MCAR: If the data said to be MCAR, it stands for Missing Completely At Random (MCAR). This means that the missing values of data are unrelated to the data itself. In other words, the probability of being missing is uniform distributed and completely independent both of observable variables and of unobservable variables, and occur entirely at random.
MAR: If the data said to be MAR, it stands for Missing At Random (MAR). This means that the missing values of data are the same only within groups defined by the observed data. In other words, the probability of being missing is not random, but where missingness can be fully accounted for by variables.
MNAR: If the data said to be MNAR, it stands for Missing Not At Random (MNAR). This means that the data data that is neither MAR nor MCAR. MNAR means that the probability of being missing varies for reasons that are unknown to us.
Reference: * Concepts of MCAR, MAR and MNAR * Concepts in incomplete data
MICE imputes missing data in a dataset (it should be dependent variable) through an iterative series of predictive models. In each iteration, each specified variable in the dataset is imputed using the other variables in the dataset. These iterations should be run until it appears that convergence has been met and all specified variables have been imputed.
Step. 1 Replace the missing value with the mean value observed in the data.
Step. 2 The random forest (or linear regression or Predictive Mean Matching) is used to predict each variable with other vairables and the correlation between some variables will improve.
Step. 3 This process is continued until all specified variables have been imputed. The MICE finished and the correlation between some variables will be much closer to original data.
NAs. Replace apparent data entry errors by NAs.icu_stay <- readRDS("../hw3/mimiciv_shiny/icu_cohort.rds")
icu_stay$thirty_day_mort <-
ifelse(is.na(icu_stay$thirty_day_mort) == T, 0, icu_stay$thirty_day_mort)
summary(icu_stay)
## subject_id intime hadm_id
## Min. :10000032 Min. :2110-01-11 10:16:06 Min. :20000094
## 1st Qu.:12483760 1st Qu.:2132-11-03 21:59:53 1st Qu.:22478418
## Median :14991992 Median :2152-09-25 12:29:44 Median :24948788
## Mean :14990135 Mean :2152-10-26 05:18:41 Mean :24976352
## 3rd Qu.:17497349 3rd Qu.:2172-10-29 04:56:08 3rd Qu.:27477545
## Max. :19999987 Max. :2211-05-01 06:59:19 Max. :29999828
##
## stay_id first_careunit last_careunit
## Min. :30000153 Length:53150 Length:53150
## 1st Qu.:32466765 Class :character Class :character
## Median :34987382 Mode :character Mode :character
## Mean :34984425
## 3rd Qu.:37487830
## Max. :39999810
##
## outtime los gender
## Min. :2110-01-12 17:17:47 Min. : 0.00125 Length:53150
## 1st Qu.:2132-11-06 22:32:25 1st Qu.: 1.08479 Class :character
## Median :2152-09-28 07:17:57 Median : 1.87642 Mode :character
## Mean :2152-10-29 13:58:59 Mean : 3.36132
## 3rd Qu.:2172-10-30 21:16:16 3rd Qu.: 3.53770
## Max. :2211-05-10 22:51:06 Max. :110.23228
##
## anchor_age anchor_year anchor_year_group dod
## Min. :18.00 Min. :2109 Length:53150 Min. :2110-01-25
## 1st Qu.:53.00 1st Qu.:2131 Class :character 1st Qu.:2133-04-29
## Median :65.00 Median :2151 Mode :character Median :2153-02-15
## Mean :63.51 Mean :2151 Mean :2153-06-18
## 3rd Qu.:77.00 3rd Qu.:2171 3rd Qu.:2173-09-20
## Max. :91.00 Max. :2207 Max. :2211-01-17
## NA's :45232
## admittime dischtime
## Min. :2110-01-11 10:14:00 Min. :2110-01-15 17:31:00
## 1st Qu.:2132-11-01 01:41:45 1st Qu.:2132-11-11 14:40:00
## Median :2152-09-24 06:45:00 Median :2152-10-03 05:40:00
## Mean :2152-10-25 04:22:49 Mean :2152-11-03 14:27:00
## 3rd Qu.:2172-10-27 23:30:45 3rd Qu.:2172-11-07 08:56:15
## Max. :2211-05-01 06:57:00 Max. :2211-05-12 17:54:00
##
## deathtime admission_type admission_location
## Min. :2110-01-25 09:40:00 Length:53150 Length:53150
## 1st Qu.:2132-07-10 19:30:00 Class :character Class :character
## Median :2152-02-26 23:34:00 Mode :character Mode :character
## Mean :2152-05-25 05:05:24
## 3rd Qu.:2172-05-04 00:07:30
## Max. :2211-01-17 12:34:00
## NA's :47642
## discharge_location insurance language marital_status
## Length:53150 Length:53150 Length:53150 Length:53150
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## ethnicity edregtime edouttime
## Length:53150 Min. :2110-01-11 21:42:00 Min. :2110-01-12 00:54:00
## Class :character 1st Qu.:2133-09-01 09:15:00 1st Qu.:2133-09-01 15:55:45
## Mode :character Median :2153-05-28 01:29:30 Median :2153-05-28 06:58:00
## Mean :2153-07-10 17:24:31 Mean :2153-07-10 23:17:28
## 3rd Qu.:2173-07-23 10:55:15 3rd Qu.:2173-07-23 17:29:15
## Max. :2211-05-01 06:07:00 Max. :2211-05-01 09:57:00
## NA's :19618 NA's :19618
## hospital_expire_flag age_hadm hematocrit magnesium
## Min. :0.0000 Min. : 18.00 Min. : 0.00 Min. :0.000
## 1st Qu.:0.0000 1st Qu.: 54.00 1st Qu.:22.60 1st Qu.:1.500
## Median :0.0000 Median : 66.00 Median :26.80 Median :1.700
## Mean :0.1038 Mean : 64.47 Mean :27.61 Mean :1.721
## 3rd Qu.:0.0000 3rd Qu.: 78.00 3rd Qu.:32.10 3rd Qu.:1.900
## Max. :1.0000 Max. :102.00 Max. :61.00 Max. :9.200
## NA's :699 NA's :857
## sodium chloride calcium WBC_count
## Min. : 67.0 Min. : 39.0 Min. : 0.000 Min. : 0.000
## 1st Qu.:131.0 1st Qu.: 94.0 1st Qu.: 7.400 1st Qu.: 4.500
## Median :135.0 Median : 98.0 Median : 7.900 Median : 6.200
## Mean :133.9 Mean : 97.3 Mean : 7.822 Mean : 6.836
## 3rd Qu.:137.0 3rd Qu.:101.0 3rd Qu.: 8.400 3rd Qu.: 8.200
## Max. :173.0 Max. :142.0 Max. :43.000 Max. :471.700
## NA's :683 NA's :690 NA's :1842 NA's :720
## bicarbonate creatinine potassium glucose
## Min. : 2.00 Min. : 0.0000 Min. :0.700 Min. : 0.0
## 1st Qu.:18.00 1st Qu.: 0.5000 1st Qu.:3.200 1st Qu.: 75.0
## Median :21.00 Median : 0.7000 Median :3.500 Median : 87.0
## Mean :20.21 Mean : 0.8932 Mean :3.495 Mean : 89.5
## 3rd Qu.:23.00 3rd Qu.: 1.0000 3rd Qu.:3.800 3rd Qu.: 99.0
## Max. :46.00 Max. :17.9000 Max. :9.400 Max. :1091.0
## NA's :700 NA's :680 NA's :682 NA's :707
## body_temperature systolic_non_invasive_blood_pressure heart_rate
## Min. :-99.90 Min. :-69.00 Min. :-241395.00
## 1st Qu.: 96.70 1st Qu.: 79.00 1st Qu.: 54.00
## Median : 97.40 Median : 90.00 Median : 62.00
## Mean : 96.53 Mean : 89.34 Mean : 57.09
## 3rd Qu.: 97.80 3rd Qu.:100.00 3rd Qu.: 72.00
## Max. :104.00 Max. :246.00 Max. : 150.00
## NA's :820 NA's :512 NA's :1
## mean_non_invasive_blood_pressure respiratory_rate thirty_day_mort
## Min. :-22767.00 Min. : 0.000 Min. :0.00000
## 1st Qu.: 48.00 1st Qu.: 7.000 1st Qu.:0.00000
## Median : 57.00 Median :10.000 Median :0.00000
## Mean : 56.15 Mean : 9.403 Mean :0.04709
## 3rd Qu.: 65.00 3rd Qu.:12.000 3rd Qu.:0.00000
## Max. : 162.00 Max. :37.000 Max. :1.00000
## NA's :533 NA's :37
#From the summary table, we decide to drop the variables with substantial missingness, say >5000 NAs
#drop "deathtime", "edregtime", "edouttime", "dod"
icu_stay <- icu_stay %>%
select(-c("deathtime", "edregtime", "edouttime", "dod"))
#replace lab and vital measurements to NA
for (var in colnames(icu_stay[,24:38])){
icu_stay[[var]][icu_stay[[var]] %in% boxplot.stats(icu_stay[[var]])$out] <- NA
}
colSums(is.na(icu_stay[,24:38]))
## hematocrit magnesium
## 879 1617
## sodium chloride
## 2383 2582
## calcium WBC_count
## 3213 2733
## bicarbonate creatinine
## 2946 4137
## potassium glucose
## 1947 3959
## body_temperature systolic_non_invasive_blood_pressure
## 3515 2284
## heart_rate mean_non_invasive_blood_pressure
## 2849 1845
## respiratory_rate
## 531
miceRanger (request \(m=3\) data sets). This step is computational intensive. Make sure to save the imputation results as a file. Hint: Setting max.depth=10 in the miceRanger function may cut some computing time.seqTime <- system.time(
miceObj <- miceRanger(
icu_stay
, m=3
, returnModels = FALSE
, verbose=FALSE
, max.depth=10
)
)
miceObj
miceObj %>% save(file = "../hw4/miceObj.RData")
completeData(miceObj)[1] %>% write_rds(file = "../hw4/icu_cohort_1.rds")
completeData(miceObj)[2] %>% write_rds(file = "../hw4/icu_cohort_2.rds")
completeData(miceObj)[3] %>% write_rds(file = "../hw4/icu_cohort_3.rds")
plotDistributions(miceObj,vars='allNumeric')
plotDistributions
From the plot, we can see the imputed distributions compared to the original distribution for each variable. The red line is the density of the original, nonmissing data while the little black line are the density of the imputed values in each of imputed datasets. If they are not matched up, then it tells us that it was not Missing Completely at Random (MCAR). Thus, we can tell that only magnesium, calcium, glucose, body_temperture, heart_rate, and respiratory_rate look like MCAR.
plotCorrelations(miceObj,vars='allNumeric')
plotCorrelations
From the plot of Convergence of Correlation, we can visualize how values between datasets converged over the iterations.
plotVarConvergence(miceObj,vars='allNumeric')
plotVarConvergence
From the plot of Center and Dispersion Convergence, we would like to make sure whether dataset converge to the true theoretical mean. It doesn’t look like this dataset had a convergence issue.
plotModelError(miceObj,vars='allNumeric')
plotModelError
From the plot above, it looks like the variables were imputed with a reasonable degree of accuracy. However, for the respiratory_rate and mean_non_invasive_blood_pressure did not do well on the r-squared.
For the correct Multiple Imputation strategy, we have to apply statistical tests in each imputed dataset and then pool the results to obtain summary estimates. For example, the descriptive statistics include pooling means and standard deviations; the difference in mean includes pooling independent t-tests; the relationship between the variables includes pooling logistic, Cox, or linear regression models. Thus, this is not a good idea to use just one imputed data set or to average multiple imputed data sets, but we can analyze each dataset and then pool their result together which is a more correct Multiple Imputation strategy.
Develop at least two analytic approaches for predicting the 30-day mortality of patients admitted to ICU using demographic information (gender, age, marital status, ethnicity), first lab measurements during ICU stay, and first vital measurements during ICU stay. For example, you can use (1) logistic regression (glm() function in base R or keras), (2) logistic regression with lasso penalty (glmnet or keras package), (3) random forest (randomForest package), or (4) neural network (keras package).
library(rsample)
icu_cohort_2 <- readRDS("~/biostat-203b-2022-winter/hw4/icu_cohort_2.rds")
#Data split
set.seed(12345)
icu_split <- initial_split(icu_cohort_2$Dataset_2, prop = 0.8
, strata = thirty_day_mort)
icu_train <- icu_split %>% training()
icu_test <- icu_split %>% testing()
rm(icu_cohort_2)
rm(icu_split)
#Logistics model
model1 <- glm(thirty_day_mort ~ ., data = icu_train
, family = binomial(link="logit"))
icu_test$first_careunit = as.factor(icu_test$first_careunit)
icu_test$last_careunit = as.factor(icu_test$last_careunit)
icu_test$gender = as.factor(icu_test$gender)
icu_test$anchor_year_group = as.factor(icu_test$anchor_year_group)
icu_test$admission_type = as.factor(icu_test$admission_type)
icu_test$admission_location = as.factor(icu_test$admission_location)
icu_test$discharge_location = as.factor(icu_test$discharge_location)
icu_test$insurance = as.factor(icu_test$insurance)
icu_test$language = as.factor(icu_test$language)
icu_test$marital_status = as.factor(icu_test$marital_status)
icu_test$ethnicity = as.factor(icu_test$ethnicity)
icu_train$first_careunit = as.factor(icu_train$first_careunit)
icu_train$last_careunit = as.factor(icu_train$last_careunit)
icu_train$gender = as.factor(icu_train$gender)
icu_train$anchor_year_group = as.factor(icu_train$anchor_year_group)
icu_train$admission_type = as.factor(icu_train$admission_type)
icu_train$admission_location = as.factor(icu_train$admission_location)
icu_train$discharge_location = as.factor(icu_train$discharge_location)
icu_train$insurance = as.factor(icu_train$insurance)
icu_train$language = as.factor(icu_train$language)
icu_train$marital_status = as.factor(icu_train$marital_status)
icu_train$ethnicity = as.factor(icu_train$ethnicity)
#Random Forest
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
fit <- randomForest(as.factor(thirty_day_mort) ~ ., data=icu_train, ntree = 300)
# Probability and class preds
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
res_preds <- predict(model1, icu_test, type = 'response')
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
pred <- as.numeric(res_preds > 0.5)
pred2 <- as.factor(pred)
status_test <- as.numeric(icu_test$thirty_day_mort)
a <- confusionMatrix(pred2, as.factor(icu_test$thirty_day_mort))
## confusion matrix
fourfoldplot(a$table, color = c("cyan", "pink"),
conf.level = 0, margin = 1
, main = "Confusion Matrix for Logistics model")
res_preds1 <- predict(fit, icu_test, type = 'response')
status_test <- as.numeric(icu_test$thirty_day_mort) - 1
b <- confusionMatrix(res_preds1, as.factor(icu_test$thirty_day_mort))
## confusion matrix
fourfoldplot(b$table, color = c("cyan", "pink"),
conf.level = 0, margin = 1
, main = "Confusion Matrix for Random Forest")
Accuracy_log <- a$overall[1]
Accuracy_rf <- b$overall[1]
Accuracy_log
## Accuracy
## 0.956444
Accuracy_rf
## Accuracy
## 0.9570085